PageRank is Precomputed Relevancy Ranking

Mark Leighton Fisher on 2008-05-30T17:04:26

Google's PageRank is precomputed relevancy ranking, where the heavy lifting of actual relevancy ranking is done by us humans. Why is this important? I was re-reading A new comparison between conventional indexing (MEDLARS) and automatic text processing (SMART), which lays out how computerized indexing can beat the best manual indexing by:

  • Using a stop-word list;
  • Using a thesaurus (synonyms); and
  • Relevancy ranking.

(It's more complicated than that, but you get the idea.) Relevancy ranking is the hardest part of the indexing job, as there are no clear-cut algorithms for relevancy ranking with both excellent precision and excellent recall (getting all of the documents you want and none of the documents you don't want). Google's PageRank works around the difficulty of relevancy ranking by handing the hardest part — the ranking of individual documents — to us humans. You can get good results from proper metadata, but metadata is useful only in environments where no one has interest in gaming the metadata (I wonder if it should be called "The Semantic Intranet"? That's where Semantic Web technologies really make sense to me.)

The original paper is worth a read, especially if you work on software that incorporates search — and these days, I suspect that almost any non-embedded program could grow to a point where it incorporates a search mechanism (and an email client, and a web browser — you get the point).


We gave up on search...

Alias on 2008-05-30T18:05:40

... and decided to more or less farm it out to professionals.

So at work, we're deploying a product called Endecca which does all the insane multi-dimensional graph magic.

Similarly, I noticed DreamHost recommending farming mail out to Google.

Seems these sorts of areas where you have some form of non-trivial problem is the ideal place to either specialise as much as possible, or centralise efforts.